39 research outputs found

    Piecewise Linear Approximations of Digitized Space Curves with Applications

    Get PDF

    Low Degree Approximation of Surfaces for Revolved Objects

    Get PDF

    Visualization-specific compression of large volume data

    Get PDF
    Abstrac

    RLFC: Random Access Light Field Compression using Key Views and Bounded Integer Encoding

    Full text link
    We present a new hierarchical compression scheme for encoding light field images (LFI) that is suitable for interactive rendering. Our method (RLFC) exploits redundancies in the light field images by constructing a tree structure. The top level (root) of the tree captures the common high-level details across the LFI, and other levels (children) of the tree capture specific low-level details of the LFI. Our decompressing algorithm corresponds to tree traversal operations and gathers the values stored at different levels of the tree. Furthermore, we use bounded integer sequence encoding which provides random access and fast hardware decoding for compressing the blocks of children of the tree. We have evaluated our method for 4D two-plane parameterized light fields. The compression rates vary from 0.08 - 2.5 bits per pixel (bpp), resulting in compression ratios of around 200:1 to 20:1 for a PSNR quality of 40 to 50 dB. The decompression times for decoding the blocks of LFI are 1 - 3 microseconds per channel on an NVIDIA GTX-960 and we can render new views with a resolution of 512X512 at 200 fps. Our overall scheme is simple to implement and involves only bit manipulations and integer arithmetic operations.Comment: Accepted for publication at Symposium on Interactive 3D Graphics and Games (I3D '19

    Chapter 13 On the Efficient Implementation of a Real-time Kd-tree Construction Algorithm

    Get PDF
    Abstract: The kd-tree is one of the most commonly used spatial data structures for a variety of graphics applications because of its reliably high acceleration performance. Several years ago, Zhou et al. devised an effective kd-tree construction algorithm that runs entirely on a GPU. In this chapter, we present improved GPU programming techniques for implementing the algorithm more efficiently on current GPUs. One of the major ideas is to reduce the number of necessary kernel functions by replacing the essential, segmented-scan, and reduction computations by simpler per-block atomic operations, thereby alleviating the overheads from multiple synchronous kernel calls. Combined with the efficient implementation of intrablock scan and reduction, using recently introduced intrinsic functions, these changes achieve remarkable performance enhancement to the kd-tree construction process. Through an example of real-time ray tracing for dynamic scenes of nontrivial complexity, we demonstrate that the proposed GPU techniques can be exploited effectively for various real-time applications. Background and our contribution For many important applications in computer graphics, such as ray tracing and those relying on particle-based computations, adopting a proper acceleration structure will affect their run-time performance greatly. Among the variety of spatial data structures, the kd-tree is frequently used because of its reliably high acceleration performance. Compared to other techniques such as grids and boundingvolume hierarchies, its relatively higher construction cost has been regarded as a drawback, despite efforts to develop an optimized algorithm (e.g., [1] and Wu et al. In this chapter, we present enhanced CUDA programming techniques for implementing the GPU method of Zhou et al. Optimizations for the large-node stage In Zhou et al.'s method, the upper levels of the kd-tree were constructed using a node-splitting scheme that comprised spatial median splitting and empty-space maximizing. In particular, based on the observation that the assumptions made in the SAH may often be inaccurate for large nodes, this stage of computation, called the large-node stage, simply selects the spatial median of the longest axis of the axis-aligned bounding box (AABB) of a node as its split position. For efficient parallel implementation on a GPU, all triangles in each large node are grouped in-3 to chunks of fixed size (i.e., 256), parallelizing the computation over the triangles in the chunks. (Note that the triangles and chunks are mapped to the threads and blocks, respectively, in the CUDA implementation.) Triangle sorting with respect to splitting planes The large-node stage iterates the node-splitting process until no large node is left. In Algorithm 2 [11], the most time-consuming parts of each iteration are the fourth and fifth steps, corresponding to lines 24-34 and 35-40, respectively, where the triangles for each large node are first sorted with respect to the splitting plane, and the triangle numbers of the resulting two child nodes are then counted. In this subsection, we present two different approaches to implementing these two steps on a GPU. We then analyze their performance in the section on experimental results. Implementation using standard data-parallel primitives As was done in For each triangle in a large node, mapped to a CUDA thread, the key issue is how to efficiently calculate its address(es) in parallel in the new triangle index list next list, whose production is complicated because of the simultaneous subdivisions of the large nodes in the current list active list. For this, a kernel is first executed over every thread block corresponding to a chunk of triangles, classifying each triangle against the respective splitting plane, and generating two bit-flag sequences of size 256 per chunk triangle bit flags. Then, for each of these, an exclusive scan is performed using the shared memory of the GPU, resulting in the local triangle offset sequences. In addition, the kernel counts the number of triangles in each bit-flag sequence by simple addition, and places this number in an array in the global memory. (Note that, for the example in Implementation using atomic operations The triangle-sorting technique described in the previous subsection requires a segmented scan to be carried out twice on the data sequences stored in the global memory, and can easily be implemented using the data-parallel primitive functions provided by the CUDPP library [2], for example. Although very effective, such an approach forces the run-time execution to be split into a sequence of synchronous kernel calls, whose overheads will impact the run-time performance adversely. To address this, observe that a side effect of using a standard segmented-scan method is that the relative order of triangle indices within a large node made of multiple chunks is retained in the respective child nodes. Such a property is important when the order of elements is essential, as in a radix sort algorithm, for example. However, retaining the strict order is unnecessary in the kd-tree construction algorithm because the order of triangles within a kd-tree's leaf node is not critical in the later ray-tracing stage. This observation allows us to implement the triangle-sorting computation by using a single faster-running kernel and replacing the segmented-scan operations with simpler per-chunk atomic operations that are supported by the CUDA API. In the new implementation, the memory configuration for the triangle index lists is slightly different, as shown in For each chunk of triangle indices in the current list, the new kernel repeats the same computation until the triangle numbers are calculated in the array [A]. A representative thread then carries out two atomic additions, respectively fetching the local offsets, one for each child node, from the corresponding atomic variables and simultaneously adding the triangle counts to them, through which we will know where to start storing the sorted triangle indices in the child nodes. Then, once per child node, each thread checks the corresponding bit flag in the triangle bit flag array, and, if set to on, puts its triangle index in the proper place in the next triangle index list, whose location can easily be deduced from the fetched offset and the offset in the triangle offsets array. In this implementation, the two segmented scans over the arrays in the global memory have been replaced by two atomic-add operations per thread block. While the computation time is already reduced markedly by this change, two per-block scans, one for each child, must still be carried out per chunk to compute the triangle offsets. While such scans can be performed effectively in the shared memory by using a standard scan method AABB computations for active large nodes Another time-consuming part of the large-node stage is the second step (lines 9 to 14 of Algorithm 2), in which the AABB of all triangles in each node is calculated. The optimization techniques described in the previous subsection can also be applied to this AABB computation. The standard reduction in the shared memory for computing per-chunk bounding boxes can be implemented more efficiently on the GPU by a simple modification of the scan implementation using the intrinsic shuffle function __shfl_up(). Then, via three pairs of atomic min and max operations, the result of each chunk reduction is written in parallel to the location in the global memory that corresponds to the large node to which the chunk belongs. Although such atomic operations are still regarded as expensive on current GPUs, we observe that our single-kernel implementation based on atomic operations runs significantly faster on the GPU than the original implementation, which needed to perform segmented reductions six times. Optimizations for the small-node stage After all large nodes are split into nodes whose triangle numbers do not exceed 64, the small-node stage starts. Because sufficient nodes are available, the computation in this stage is parallelized over nodes instead of triangles, evaluating the precise SAH metric to find the best splitting plane for each small node. The key to the efficient implementation of this stage is exploiting a preprocessed data structure that facilitates the iterative node-splitting process. For each initial small node, called the small root node, up to 384 (= 64 (triangles) * 3 (x-, y-, z-axes) * 2 (min/max)) splitting-plane candidates are first collected from triangles in the node. Then, for each candidate, two 8-byte bit masks are generated to represent the triangle sets contained in both sides. To represent this information, 20 bytes of memory per node is necessary, including the 4 bytes used to store the location of the splitting plane, implying that up to 7,680 (=20 * 384) bytes of memory may be necessary for each small root node. It is important to choose an appropriate memory layout for the representation because the nontrivial amount of data will be accessed in parallel during the small-node stage. Although several different configurations are possible, we observed that the combination of a 4-byte access from the global memory for the splitting plane location and another 16-byte access from the texture memory for the triangle sets incurred the lowest memory latency on the GPU tested. (Our analysis of the generated PTX code showed that 16 bytes of data were fetched from texture memory even for a 4-byte access command.) With this representation, the SAH cost evaluation and triangle sorting in the subsequent node-splitting step can be performed efficiently using simple bitwise 8 operations. In this process, a parallel bit-counting operation is carried out very frequently to obtain the numbers of triangles in the child nodes. Whereas the method presented in Experimental results To measure the performance improvement achieved by the optimization techniques presented here, we first implemented the kd-tree construction algorithm of Zhou et al. on an NVIDIA GeForce GTX 680 GPU, effectively as described in the original paper. In doing this, we used the scan and reduction techniques described in Concluding remarks In this chapter, we have presented efficient GPU programming techniques for implementing the well-known kd-tree construction algorith

    Generation of Non-uniform Meshes for Finite-Difference Time-Domain Simulations

    Get PDF
    Abstract -In this paper, two automatic mesh generation algorithms are presented. The methods seek to optimize mesh density with regard to geometries exhibiting both fine and coarse physical structures. When generating meshes, the algorithms attempt to satisfy the conditions on the maximum mesh spacing and the maximum grading ratio simultaneously. Both algorithms successfully produce non-uniform meshes that satisfy the requirements for finite-difference time-domain simulations of microwave components. Additionally, an algorithm successfully generates a minimum number of grid points while maintaining the simulation accuracy

    On Surface Design with Implicit Algebraic Surfaces (Ph.D. Thesis)

    Get PDF
    Computer Aided Geometric Design (CAGD) is a rapidly growing area that involves theories and techniques from many disciplines such as computer science and mathematics as well as engineering. One of the most important subjects in CAGD is to efficiently model physical objects with a surface or collection of surfaces for many applications of CAD/CAM, computer graphics, medical imaging, robotics and etc. Most research in surface modeling has been largely dominated by the theory of parametrically represented surfaces. While they have been successfully used in representing physical objects, parametric surfaces are confronted with some problems when objects represented with them are manipulated in geometric modeling systems. In recent years, increasing attention has been paid to algebraic surfaces that are implicitly defined by a polynomial equation, and provide a more general class of surfaces at lower degrees. In this thesis, we consider the problem of modeling complex geometric objects with smooth piecewise algebraic surface patches. We present an interpolation algorithm, called Hermite interpolation, which characterizes a class of all algebraic surfaces of a specified degree that interpolate given points and space curves with tangent plane continuity. The Hermite interpolation algorithm with least squares approximation transforms the geometric problem of algebraic surface design into a linear algebra problem which can be solved efficiently. Based on this algebraic model, we explore the class of quintic algebraic surfaces to smooth convex polyhedra with a mesh of smooth piecewise algebraic surface patches. Degrees of freedom in constructing wire frames for polyhedra are used to control shapes of curved models of polyhedra. The open problem of modeling polyhedra having arbitrary shapes with quintic triangular algebraic surface patches is considered. Finally, we present a heuristic algorithm which quickly computes a good piecewise linear approximation of a given digitized space curve. This algorithm serves as a primary tool in polygonizing triangular algebraic surface patches

    Hermite Interpolation of Rational Space Curves Using Real Algebraic Surfaces

    Get PDF
    corecore